{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "layout: tutorial\n",
    "title: Datasets, Instances, and Fields\n",
    "id: datasets-instances-fields\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "\n",
    "Allennlp uses a hierarchical system of data structures to represent a Dataset which allow easy padding, batching and iteration. This tutorial will cover some of the basic concepts.\n",
    "\n",
    "\n",
    "At a high level, we use `DatasetReaders` to read a particular dataset into an `Iterable` of self-contained individual `Instances`, \n",
    "which are made up of a dictionary of named `Fields`. There are many types of `Fields` which are useful for different types of data, such as `TextField`, for sentences, or `LabelField` for representing a categorical class label. Users who are familiar with the `torchtext` library from `Pytorch` will find a similar abstraction here. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This cell just makes sure the library paths are correct. \n",
    "# You need to run this cell before you run the rest of this\n",
    "# tutorial, but you can ignore the contents!\n",
    "import os\n",
    "import sys\n",
    "module_path = os.path.abspath(os.path.join('../..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create two of the most common `Fields`, imagining we are preparing some data for a sentiment analysis model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tokens in TextField:  [This, movie, was, awful, !]\n",
      "Label of LabelField negative\n"
     ]
    }
   ],
   "source": [
    "from allennlp.data import Token\n",
    "from allennlp.data.fields import TextField, LabelField\n",
    "from allennlp.data.token_indexers import SingleIdTokenIndexer\n",
    "\n",
    "review = TextField(list(map(Token, [\"This\", \"movie\", \"was\", \"awful\", \"!\"])), token_indexers={\"tokens\": SingleIdTokenIndexer(namespace=\"token_ids\")})\n",
    "review_sentiment = LabelField(\"negative\", label_namespace=\"tags\")\n",
    "\n",
    "# Access the original strings and labels using the methods on the Fields.\n",
    "print(\"Tokens in TextField: \", review.tokens)\n",
    "print(\"Label of LabelField\", review_sentiment.label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we've made our `Fields`, we need to pair them together to form an `Instance`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fields in instance:  {'review': <allennlp.data.fields.text_field.TextField object at 0x10882e588>, 'label': <allennlp.data.fields.label_field.LabelField object at 0x1088322b0>}\n"
     ]
    }
   ],
   "source": [
    "from allennlp.data import Instance\n",
    "\n",
    "instance1 = Instance({\"review\": review, \"label\": review_sentiment})\n",
    "print(\"Fields in instance: \", instance1.fields)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... and once we've made our `Instance` s, we can just collect them in a `list`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from allennlp.data.dataset import Batch\n",
    "\n",
    "# Create another \n",
    "review2 = TextField(list(map(Token, [\"This\", \"movie\", \"was\", \"quite\", \"slow\", \"but\", \"good\", \".\"])), token_indexers={\"tokens\": SingleIdTokenIndexer(namespace=\"token_ids\")})\n",
    "review_sentiment2 = LabelField(\"positive\", label_namespace=\"tags\")\n",
    "instance2 = Instance({\"review\": review2, \"label\": review_sentiment2})\n",
    "\n",
    "instances = [instance1, instance2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to get our tiny sentiment analysis dataset ready for use in a model, we need to be able to do a few things: \n",
    "- Create a vocabulary from the dataset (using `Vocabulary.from_instances`)\n",
    "- Collect the instances into a `Batch` (which provides methods for indexing and converting to tensors)\n",
    "- Index the words and labels in the `Fields` to use the integer indices specified by the `Vocabulary`\n",
    "- Pad the instances to the same length\n",
    "- Convert them into tensors.\n",
    "\n",
    "The `Batch`, `Instance` and `Fields` have some similar parts of their API. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 2/2 [00:00<00:00, 18315.74it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is the id -> word mapping for the 'token_ids' namespace: \n",
      "{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'This', 3: 'movie', 4: 'was', 5: 'awful', 6: '!', 7: 'quite', 8: 'slow', 9: 'but', 10: 'good', 11: '.'} \n",
      "\n",
      "This is the id -> word mapping for the 'tags' namespace: \n",
      "{0: 'negative', 1: 'positive'} \n",
      "\n",
      "Vocab Token to Index dictionary:  defaultdict(None, {'token_ids': {'@@PADDING@@': 0, '@@UNKNOWN@@': 1, 'This': 2, 'movie': 3, 'was': 4, 'awful': 5, '!': 6, 'quite': 7, 'slow': 8, 'but': 9, 'good': 10, '.': 11}, 'tags': {'negative': 0, 'positive': 1}}) \n",
      "\n",
      "Lengths used for padding:  {'review': {'num_tokens': 8}} \n",
      "\n",
      "{'review': {'tokens': Variable containing:\n",
      "    2     3     4     5     6     0     0     0\n",
      "    2     3     4     7     8     9    10    11\n",
      "[torch.LongTensor of size 2x8]\n",
      "}, 'label': Variable containing:\n",
      " 0\n",
      " 1\n",
      "[torch.LongTensor of size 2x1]\n",
      "}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from allennlp.data import Vocabulary \n",
    "from allennlp.data.dataset import Batch\n",
    "\n",
    "# This will automatically create a vocab from our dataset.\n",
    "# It will have \"namespaces\" which correspond to two things:\n",
    "# 1. Namespaces passed to fields (e.g. the \"tags\" namespace we passed to our LabelField)\n",
    "# 2. The keys of the 'Token Indexer' dictionary in 'TextFields'.\n",
    "# passed to Fields (so it will have a 'tags' namespace).\n",
    "vocab = Vocabulary.from_instances(instances)\n",
    "\n",
    "print(\"This is the id -> word mapping for the 'token_ids' namespace: \")\n",
    "print(vocab.get_index_to_token_vocabulary(\"token_ids\"), \"\\n\")\n",
    "print(\"This is the id -> word mapping for the 'tags' namespace: \")\n",
    "print(vocab.get_index_to_token_vocabulary(\"tags\"), \"\\n\")\n",
    "print(\"Vocab Token to Index dictionary: \", vocab._token_to_index, \"\\n\")\n",
    "# Note that the \"tags\" namespace doesn't contain padding or unknown tokens.\n",
    "\n",
    "# Next, we index our dataset using our newly generated vocabulary.\n",
    "# This modifies the current object. You must perform this step before \n",
    "# trying to generate arrays. \n",
    "batch = Batch(instances)\n",
    "batch.index_instances(vocab)\n",
    "\n",
    "# Finally, we return the dataset as arrays, padded using padding lengths\n",
    "# extracted from the dataset itself, which will be the max sentence length\n",
    "# from our two instances.\n",
    "padding_lengths = batch.get_padding_lengths()\n",
    "print(\"Lengths used for padding: \", padding_lengths, \"\\n\")\n",
    "tensor_dict = batch.as_tensor_dict(padding_lengths)\n",
    "print(tensor_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we've seen how to transform a dataset of 2 instances into arrays for feeding into an allennlp `Model`.  If you are iterating over a large number of `Instances`, such as during training, you may want to look into `allennlp.data.iterators`, which specify several different ways of iterating over a dataset in batches, such as fixed batch sizes, bucketing and stochastic sorting. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's been one thing we've left out of this tutorial so far - explaining the role of the `TokenIndexer` in `TextField`. We decided to introduce a new step into the typical `tokenisation -> indexing -> embedding` pipeline, because for more complicated encodings of words, such as those including character embeddings, this pipeline becomes difficult. Our pipeline contains the following steps: `tokenisation -> TokenIndexers -> TokenEmbedders -> TextFieldEmbedders`. \n",
    "\n",
    "The token indexer we used above is the most basic one - it assigns a single ID to each word in the `TextField`. This is classically what you might think of when indexing words. \n",
    "However, let's take a look at using a `TokenCharacterIndexer` as well - this takes the words in a `TextField` and generates indices for the characters in the words.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1it [00:00, 7037.42it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is the id -> word mapping for the 'token_ids' namespace: \n",
      "{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'Here', 3: 'are', 4: 'some', 5: 'longer', 6: 'words', 7: '.'} \n",
      "\n",
      "This is the id -> word mapping for the 'token_chars' namespace: \n",
      "{0: '@@PADDING@@', 1: '@@UNKNOWN@@', 2: 'e', 3: 'r', 4: 'o', 5: 's', 6: 'H', 7: 'a', 8: 'm', 9: 'l', 10: 'n', 11: 'g', 12: 'w', 13: 'd', 14: '.'} \n",
      "\n",
      "Lengths used for padding (Note that we now have a new padding key num_token_characters from the TokenCharactersIndexer): \n",
      "{'sentence': {'num_tokens': 6, 'num_token_characters': 6}} \n",
      "\n",
      "{'sentence': {'tokens': Variable containing:\n",
      " 2  3  4  5  6  7\n",
      "[torch.LongTensor of size 1x6]\n",
      ", 'chars': Variable containing:\n",
      "(0 ,.,.) = \n",
      "   6   2   3   2   0   0\n",
      "   7   3   2   0   0   0\n",
      "   5   4   8   2   0   0\n",
      "   9   4  10  11   2   3\n",
      "  12   4   3  13   5   0\n",
      "  14   0   0   0   0   0\n",
      "[torch.LongTensor of size 1x6x6]\n",
      "}}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from allennlp.data.token_indexers import TokenCharactersIndexer\n",
    "\n",
    "word_and_character_text_field = TextField(list(map(Token, [\"Here\", \"are\", \"some\", \"longer\", \"words\", \".\"])), \n",
    "                                          token_indexers={\"tokens\": SingleIdTokenIndexer(namespace=\"token_ids\"), \"chars\": TokenCharactersIndexer(namespace=\"token_chars\")})\n",
    "mini_dataset = Batch([Instance({\"sentence\": word_and_character_text_field})])\n",
    "\n",
    "# Fit a new vocabulary to this Field and index it:\n",
    "word_and_char_vocab = Vocabulary.from_instances(mini_dataset)\n",
    "mini_dataset.index_instances(word_and_char_vocab)\n",
    "\n",
    "print(\"This is the id -> word mapping for the 'token_ids' namespace: \")\n",
    "print(word_and_char_vocab.get_index_to_token_vocabulary(\"token_ids\"), \"\\n\")\n",
    "print(\"This is the id -> word mapping for the 'token_chars' namespace: \")\n",
    "print(word_and_char_vocab.get_index_to_token_vocabulary(\"token_chars\"), \"\\n\")\n",
    "\n",
    "\n",
    "# Now, the padding lengths method will find the max sentence length \n",
    "# _and_ max word length in the batch and pad all sentences to the max\n",
    "# sentence length and all words to the max word length.\n",
    "padding_lengths = mini_dataset.get_padding_lengths()\n",
    "print(\"Lengths used for padding (Note that we now have a new \"\n",
    "      \"padding key num_token_characters from the TokenCharactersIndexer): \")\n",
    "print(padding_lengths, \"\\n\")\n",
    "\n",
    "tensor_dict = mini_dataset.as_tensor_dict(padding_lengths)\n",
    "\n",
    "print(tensor_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now we've used a new token indexer, you can see that the `review` field of the returned dictionary now has 2 elements: `tokens`, a tensor representing the indexed tokens and `chars`, a tensor representing each word in the `TextField` as a list of character indices. Crucially, each list of integers for each word has been padded to the length of the maximum word in the sentence. Note that the keys for the dictionary of `token_indexers` for the `TextField` are different from the `namespaces`. This is because it's possible to re-use a `namespace` in different `TokenIndexers`. In the [Embedding Tokens notebook](embedding_tokens.ipynb) tutorial, we'll see that the keys allow us to create a 1-1 mapping between how tokens are _indexed_ with `TokenIndexers` and _embedded_ with `TokenEmbedders`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
